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Abstract 

Background: Socchoromyces cerevisioe is recognized as a model system representing a simple eukaryote whose 
genome can be easily manipulated. Information solicited by scientists on its biological entities (Proteins, Genes, 
RNAs...) is scattered within several data sources like SGD, Yeastract, CYGD-MIPS, BioGrid, PhosphoGrid, etc. Because 
of the heterogeneity of these sources, querying them separately and then manually combining the returned results 
is a complex and time-consuming task for biologists most of whom are not bioinformatics expert. It also reduces 
and limits the use that can be made on the available data. 

Results: To provide transparent and simultaneous access to yeast sources, we have developed YeostMed: an XML 
and mediator-based system. In this paper, we present our approach in developing this system which takes 
advantage of SB-KOM to perform the query transformation needed and a set of Data Services to reach the 
integrated data sources. The system is composed of a set of modules that depend heavily on XML and Semantic 
Web technologies. User queries are expressed in terms of a domain ontology through a simple form-based web 
interface. 

Conclusions: YeostMed is the first mediation-based system specific for integrating yeast data sources. It was 
conceived mainly to help biologists to find simultaneously relevant data from multiple data sources. It has a 
biologist-friendly interface easy to use. The system is available at http://www.khaos.uma.es/yeastmed/. 



Background 

The yeast Saccharomyces cerevisiae, known popularly as 
bakers' or brewers' yeast, has been used extensively in 
aging research. It is a unicellular organism whose DNA 
is packaged into chromosomes that are localized in a 
subcellular structure called the nucleus. Since 1990, it 
has emerged as an important model organism for the 
dissection of the biological aging process at the genetic 
and molecular levels. Saccharomyces cerevisiae was the 
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first eukaryotic genome that was completely sequenced 

in. 

Nowadays, the word yeast is widely given to the spe- 
cies Saccharomyces cerevisiae because of the place it 
occupies in biological research. Large amounts of data 
related to it are genereted by Life Science and stored in 
multiple databases. Biologists are brought systematically 
to query these sources in order to analyse the results of 
their experiments. They usually perform the following 
tasks during query formulation and execution: (i) look 
for appropriate sources where it is possible to find help- 
ful data and specify their location, (ii) identify the focus 
of each source, (iii) query each convenient source inde- 
pendently using its specific access method and query 
language, (iv) navigate through the sources to obtain 
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complementary data, and (vi) manuallymerge the results 
obtained from different sources. This is a tedious and 
time-consuming task for biologists, most of whom are 
not bioinformatics experts, and reduces the advantage 
that can be took of the available information. 

The challenges of modern bioinformatics research is 
not only storing data in repositories, but also processing 
and integrating them. Multiple solutions to biological 
data integration have been developed. Researchers have 
come up with some approaches that integrate diverse 
biological data sources. Two common approaches have 
being used to interoperate biological databases: data 
warehousing (or materialised) approach [2] and feder- 
ated/mediator-based (or virtual) approach [3]. 

The data warehousing approach is adopted by numer- 
ous biological integration systems like GUS [4], Atlas 
[5], BioSQL [6], BioMart [7], BioWarehouse [8], and 
chado [9]. This approach uses a data warehouse reposi- 
tory that provides a single access point to a collection of 
data, obtained from a set of distributed, heterogeneous 
sources. Data from the remote heterogeneous databases 
are copied on a local server and the user will use a 
unique interface within the system to allow multi-data- 
base queries to be issued to this single interface. Data 
warehousing requires the use of Extraction, Transforma- 
tion and Load (ETL) [10] tools to load data, and map it 
to a materialised global schema. In fact, warehousing 
requires that all the data loaded from the sources be 
converted through data mapping to standard unique for- 
mat before it is physically stored locally. Relying less on 
the network to access the data clearly helps to eliminate 
various problems such as network bottlenecks, low 
response times, and the occasional unavailability of 
sources. Furthermore, using materialised warehouses 
allows for an improved efficiency of query optimisation 
as it can be performed locally [11,12]. Another benefit 
in the data warehouse integration approach is that it 
allows the system to filter, validate, modify, and anno- 
tate the data obtained from the heterogenous sources 
and this has been noted as a very attractive property for 
bioinformatics. This approach however has an important 
and costly drawback in terms of reliability of results and 
overall system maintenance caused by the possibility of 
returning outdated results. Warehouse integration must 
indeed regularly check all the underlying sources for 
new or updated data and then reflect those modifica- 
tions on the local copy of the data [12]. 

Virtual integration (mainly mediator-based systems) 
concentrates on query rewriting: It rewrites the user 
query, into queries that are understood by the integrated 
sources. The mediator uses the relationships between 
sources and a global schema to translate queries on the 
mediator schema to the data source schemata. The two 
main approaches for establishing the mappings between 



each source schema and the integration schema are glo- 
bal-as-view (GAV) and local-as-view (LAV) [11,13]. In 
the GAV approach the mediator relations are directly 
written in terms of the source relations. The GAV 
approach greatly facilitates query reformulation as it 
simply becomes a view unfolding process. In LAV 
approach every source relation is defined over the rela- 
tions and the schema of the mediator. It is therefore up 
to the individual sources to provide a description of 
their schema in terms of the global schema, making it 
very simple to add or remove sources but also compli- 
cating the query reformultaion and processing role of 
mediator. 

The mediator-based approach has several strengths 
compared to data warehouse. It does not have the 
updating problem as the query goes directly to the origi- 
nal source. Mediators can be seen as a cheaper and 
more effective approach since they use schema or view 
integration, rather than having to have huge storage 
capacity to store copied data from all the involved data 
sources. 

This paper presents a mediator-based system called 
YeastMed [14] that aims to provide transparent access 
to disparate biological databases of yeast. It provides a 
unique interface between the user who submits a query, 
and a set of five data sources accessible via web proto- 
cols. YeastMed relies on SB-KOM [15] to perform the 
query transformation needed to reach the integrated 
data sources. These sources are: SGD [16], Yeastract 
[17], CYGD-MIPS [18], BioGrid [19] and PhosphoGrid 
[20]. They provide complementary data on biological 
entities (cellular interaction, metabolic pathways, tran- 
scription factors, annotation data...). With YeastMed, we 
aim to help biologists to get relevant data to understand 
and explain the biological processes of interest by using 
an integrative system. 

This paper is organised as follows: an overview on 
some biological data integration systems is given in the 
next section. Then, a general overview of the system 
and the resources used in YeastMed are given before to 
describe the integration process components along with 
some explanatory schemas. A detailed use case is then 
sketched describing how YeastMed proceeds when a 
user query is submitted. At the end, we discuss some 
advantages and limitations of the current version of 
YeastMed before to conclude the paper. 

Related work 

Works specific to the integration of yeast data sources 
are not abundant. However a variety of data integration 
systems especially tailored to cater for bioinformatics 
applications have been developed. These systems can 
broadly be classified as: data warehouses, federated/med- 
iator-based systems and XML-based systems. 
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Data warehouse systems 

Several attempts have been made to create integrated 
environments for storing and analysing biological data. 
For the sake of brevity, we sketch here two of them due 
to their relation to yeast. 

Cell Cycle Database [21] is an integrated data ware- 
house for systems biology modelling and cell cycle ana- 
lysis based on yeast and mammalian organisms. The 
system integrates information about genes and proteins 
involved in the cell cycle process. It stores complete 
models of interaction networks and allows the mathe- 
matical simulation over time of the quantitative beha- 
viour of each component. The database integration 
system consists of a series of programs used to retrieve 
the data from several different external databases, trans- 
form and load them into the warehouse data model 

YeastHub [22] is a prototype application in which a 
data warehouse has been constructed in order to store 
and query different types of yeast genome data provided 
by different resources in different formats including the 
tabular and RDF formats. Once the data are loaded into 
the data warehouse, RDF-based queries can be formu- 
lated to retrieve and query the data in an integrated fash- 
ion. YeastHub is implemented using Sesame 1.1 [23]. 
The tabular-to-RDF conversion is written using Java. 

These two systems present some limitations: they 
store the extracted data locally in a data warehouse or 
database which render the updating process a tedious 
task. YeastHub presents another problem where Sesame 
does not have a way to identify the source of the triples 
(statements) once they are loaded into the repository. In 
contrast, YeastMed accesses and interrogates data in its 
original data sources, and provides the user with the 
possibility to choose which data source entry to return. 
If the user doesn't make a choice, the system explicitly 
gives the provenance of the result entries. 
Federated and mediator-based systems 
Other alternative solutions have been proposed in biolo- 
gical data integration adopting a virtual approach. 
Among them we can cite: 

Kleisli [4] is as a mediator system encompassing a 
nested relational data model, a high-level query lan- 
guage, and a powerful query optimiser. It runs on top of 
a large number of light-weight wrappers for accessing 
various data sources. The Kleisli system is highly exten- 
sible. It can be used to support several high-level query 
languages by replacing its high-level query language 
module. Kleisli supports the Collection Programming 
Language (CPL) [24] and a nested relational version of 
SQL. However Kleisli does not use any global schema or 
ontology over which a user can formulate queries. A 
query attribute is bound to a matched attribute in single 
source, so there is no integration across different 
sources. 



DiscoveryLink [25] is a wrapper-oriented bioinfor- 
matics integration system built on the Garlic project 
technology [26]. It serves as a middleware between the 
applications and a set of wrappers. Applications connect 
to DiscoveryLink and submit an SQL query on its global 
schema. The wrappers provide source-specific informa- 
tion about query capabilities that help the optimiser to 
determine which parts of a query can be submitted to 
each source. The query optimiser considers the speed of 
various sources, their network connections, and the size 
of their data to predicate the costs of different plans. 
DiscoveryLink, however, cannot deal with complex 
source data such as nested data. Most biological data, 
unfortunately, are highly nested. Therefore, there is a 
significant amount of mismatch between most data 
sources and DiscoveryLink. Furthermore, it is hard to 
add new data sources or analysis tools to DiscoveryLink. 
In addition, DiscoveryLink requires SQL as its query 
language, which is not easy for biologists to write. 

TAMBIS [27] is a mediator-based and ontology-driven 
integration system, it has three layers: the conceptual 
model, the mapping model and the physical model. In 
TAMBIS, the formulation of queries is done through a 
graphical interface where user needs to browse through 
the different concepts defined in the global schema and 
select the suitable ones for particular query. As the first 
step, the system expresses the graphical query in GRAIL 
[28]. Then, the query is translated into a Query Internal 
Form (QIF), which is in turn translated into a source- 
dependent query execution plan in CPL [24]. The global 
ontology is a unified conceptual-level representation of 
its registered component resources. It provides a global 
schema as well as an abstract framework for relating, 
reconciling, and coordinating the concepts in the 
sources. The mapping model converts a query phrased 
in terms of the conceptual layer into executable plans in 
terms of each source. The physical model submits the 
executable plans to different sources and retrieves the 
results. Although TAMBIS is more of an upper level 
solution than other systems, but its graphical interface is 
very complicated and requires that a user understands 
the query language. BioMediator [29] is a federated data 
integration system based on XML. It uses a mediated 
schema which allows for more flexible data modelling. 
The central component of BioMediator system is its 
source knowledge base, which consists of descriptions of 
the various data sources, mappings from the source to 
the mediated schema, and the mediated schema itself. 
The system include also wrappers that conduct syntactic 
translations by translating the returned data results into 
an XML document, a metawrapper that conducts 
semantic translations by mapping the returned XML 
document onto the mediated schema, and a query pro- 
cessor that queries (using XQuery Language) against the 
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mediated schema. BioMediator is thus dedicated to 
users who know the XQuery language and is not willing 
to be used by external research groups. 

Compared to these systems, YeastMed is the first data 
integration system which adopts the mediation approach 
to integrate yeast-specific data sources. It has a domain 
ontology which plays the role of the global schema and 
supports the user queries. Unlike the systems cited 
above, YeastMed has an easy-to-use ontology driven 
interface where users express their requests in simple 
natural language. Users do not need to know a specific 
query language to use it. In addition, due to its modular 
design, YeastMed furnishes the possibility to add easily 
new data sources or analysis tools. 
XML-based systems 

Despite the possibility to use standard approaches for 
data integration [30], specific approaches based on the 
employment of XML in Bioinformatics have been 
proposed: 

Automed [31] is a heterogeneous data transformation 
and integration system which offers the capability of 
handling virtual, materialised and hybrid data transfor- 
mation/integration across multiple data model. Auto- 
Med uses XML DataSource Schema (XMLDSS) as a 
common representation language and schema type sup- 
porting the annotations for each source by suitable 
ontologies. An XMLDSS schema can be automatically 
extracted from an XML document or automatically 
derived from an accompanying DTD/XML schema if 
one is available. 

The system approach is based on: (i) XML as a com- 
mon representation format; (ii) XMLDSS as the schema 
type for the XML documents input to and output by 
services; (iii) Correspondence to available ontologies; i.e. 
the services inputs/outputs are annotated with corre- 
spondences between the XMLDSS schema and some 
existing ontologies; and (iv) AutoMed toolkit to automa- 
tically transform the XMLDSS schema to output of a 
given service to the XMLDSS schema of the input of 
another service. 

SWAMI [32] defines a rich middleware architecture 
that integrate different databases, formats and computa- 
tional resources. Its architecture design includes a Pre- 
sentation layer that receives user requests, passes them 
to the Core workbench Application, and returns applica- 
tion results to user by the same route. The Core Appli- 
cation consists of four major components: The user 
module which receives data and instructions from the 
Presentation Layer. The Broker module which interacts 
with the others modules via APIs and serves as coordi- 
nator using a registry service that maintains information 
about all available services and databases. Then The 
Tool and Data modules, which are conceptually identi- 
cal, abstract respectively applications and databases, and 



perform their functions by orchestrating a series of ser- 
vices. XML is used for the declarative specification of 
services. 

Methods 

YeastMed is a mediator-based system that consists of 
several components contributing to the data integration 
process in different ways. In this section we talk in detail 
about the process for creating the system by giving 
descriptions of its components and the role of each of 
them. 

YeastMed overview 

The general architecture of the YeastMed system is 
shown in (Figure 1). It consists of a set of components 
that have been implemented independently and play dif- 
ferent roles. The access point to the system is a web 
interface that furnishes two search forms: 

■ A quick search form where scientists can quickly 
submit their requests based on some keywords 
(Gene or Protein names, GO terms or any other 
words that can appear in the search fields of the 
interrogated data sources). This type of search does 
not make use of the mediator. It exploits the 
YeastMed web services to look for information in 
the integrated data sources. 

■ An ontology-driven search form which allows biol- 
ogists to express their requests in terms of the 
YeastMed Ontology, These terms are presented in 
natural language to ease the query formulation pro- 
cess for biologists most of whom are not familiar 
with knowledge representation and query languages. 

YeastMed relies on SB-KOM [15] to perform query 
transformation at execution time. Once the user submits 
a request from the web interface, YeastMed generates a 
conjunctive query. SB-KOM decomposes this query into 
suitable sub-queries to individual sources based on a set 
of mapping rules. These sub-queries are expressed in 
XQuery, because the sources are accessed through web 
services using this query language. 

YeastMed have a set of web services (Data Services for 
us): one for each integrated source. These components 
receive XQueries from SB-KOM and return XML docu- 
ments. The role of the web services is to allow 
YeastMed to use wrapper functionalities to find and 
extract solicited information from data sources through 
their web pages or FTP mechanisms. Answers, materia- 
lised by XML documents, to XQueries are sent to the 
mediator which combines them into a YeastMed ontol- 
ogy instance expressed in RDF. The final result is pro- 
vided for the user in HTML format. Data sources are 
also an important component in the YeastMed 
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Figure 1 General architecture of YeastMed system. It shows how the different components of YeastMed System are structured and interact 
between them. 



architecture because they are the providers of the biolo- 
gical information. 

Integrated data sources 

In its current version, YeastMed integrates five Yeast 
databases. They have been selected for having the most 
appropriate properties for studying Saccharomyces cere- 
visiae, because they provide complementary data con- 
cerning genome, proteome, metabolome and reactome. 
These sources are: 

■ SGD Database [16]: It contains the sequences of 
yeast genes and proteins, descriptions and classifica- 
tions of their biological roles, molecular functions, 
subcellular localisations, links to literature informa- 
tion and tools for analysis and comparison of 
sequences. 

■ YEASTRACT Database [17]: It is a repository of 
regulatory associations between transcription factors 



and target genes, based on experimental evidence 
which was spread throughout bibliographic refer- 
ences. Each regulation has been annotated manually, 
after examination of the relevant references. The 
database also contains the description of specific 
DNA binding sites for a sub-group of transcription 
factors. 

■ MIPS-CYGD [18]: aims in general to present 
information on the molecular structure and func- 
tional network of Saccharomyces cerevisiae. In addi- 
tion, the data of various projects on related yeasts 
are also used for comparative analysis. 

■ BioGRID [19]: It is an online interaction reposi- 
tory with data compiled through comprehensive 
curation efforts. All interaction data are freely pro- 
vided through the search index and available via 
download in a wide variety of standardised formats. 

■ PhosphoGRID [20]: records the positions of speci- 
fic phosphorylated residues on gene products. 
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Where available for specific sites, PhosphoGRID has 
also noted the relevant protein kinases and/or phos- 
phatases, the specific condition(s) under which phos- 
phorylation occurs, and the effect(s) that 
phosphorylation has on protein function. 

YeastMed user interface 

The fact that biologists are familiar with HTML forms 
when interrogating biological databases, and in order to 
make YeastMed easy to use, we have adopted the same 
strategy that most biological databases are using to 
receive queries: A simple HTML-form-based interface 
has been developed permitting the queries to be 
expressed in natural language. It is an ontology driven 
interface. Users formulate their queries by selecting 
items from the form fields. These items have their 
equivalents in the YeastMed ontology (concepts and 
properties) and are written in natural language. For 
example the concept BibRef in the ontology is translated 
in the form fields as Bibliographic Reference and the 
datatype property hasProductDesc as having Product 
Description, We are convinced that it is very easy for 
users to express in natural language their requests by 
using implicitly triplets composed of, those designed in 
the terminology of ontologies by, domain, property and 
range. For example, the user interested in the set of 
genes regulated by the transcription factor having the 
standard name Adrl, can express it using the two asser- 
tions: "Gene regulated by Transcription Factor" and 
"Transcription Factor has standard name Adrl". In this 
context, we have designed the YeastMed interface to 
capture these kinds of expressions. The query form pro- 
poses three fields per line. Each line represents the tri- 
plet formed by: domain, property and range. Range can 
be either a concept to select from the third field in a 
line or a literal value to introduce in a field that appears 
at the bottom of the second field if a datatype property 
has been selected in it. The example above can be cap- 
tured in YeastMed interface using two field lines as fol- 
lows: in the first line "Gene" "regulated by" 
"Transcription Factor" and in the second line "Tran- 
scription Factor" "having Standard Name" "Adrl" (Fig- 
ure 2). When submitted, the system makes use of the 
equivalents of these in the ontology and creates the con- 
junctive query: Ans (G) := Gene (G) , regulatedBy 
(G, TF) , TranscriptionFactor (TF) , has- 
StandardName (TF, "Adrl" ) before to send it to 
the mediator component. 

The YeastMed web site also gives the possibility for 
users to use a quick search form to interrogate the five 
integrated databases without using the mediator. Users 
simply enter their keywords in an input field, select the 
databases to be looked up and the system takes 



advantage of the YeastMed Data Services to access and 
extract data from the underlying sources. 

Data integration in YeastMed 

YeastMed has a set of modules that depend heavily on 
XML and semantic web technologies to integrate syn- 
tactically and semantically biological data. In what fol- 
lows, we give detailed information on these components. 
Source schemas 

The knowledge modelling of the application domain of 
YeastMed constitutes the corner stone for an efficient 
integration. To that end, a detailed study of the sources 
has been carried out with the goal of establishing a stan- 
dard terminology to describe the data. Each data source 
has been modelled by an exported XML Schema (Figure 
3). An exported schema refers to translated source 
schema in the YeastMed Ontology. These schemas are 
considered as models describing data and their organisa- 
tion in data sources and define a structure under which 
results will be returned by Data Services. 
Data services 

YeastMed uses a set of web services (called in our case 
Data Services) to access data sources. We have devel- 
oped one Data Service for each integrated yeast source. 
These components hide technical and data model details 
of the data source from the mediator. They receive 
XQueries from SB-KOM and return XML documents in 
addition to other metadata. The role of YeastMed Data 
Services is twofold: 

■ Allowing YeastMed to use the wrapper functional- 
ities to find and extract solicited information from 
data sources using HTML protocols or FTP 
mechanisms. This means providing the ability to 
solve XQueries and return answers in XML format. 

■ Exporting semantic information about data sche- 
mas and data provenance. This allows mainly 
YeastMed to keep track of the returned information 
when combining them and also which source is 
being interrogated. 

It is common knowledge that a wrapper is an interface 
for a data source that translates data into the common 
data model used by mediators [33]. Because the goal of 
YeastMed is to integrate databases accessible via Web 
protocols, it is completely normal that a wrapper is con- 
sidered as the most important component of the archi- 
tecture of YeastMed Data Services. It is an interface that 
receives XQueries generated by SB-KOM, accesses a 
specific data source, extracts data and translates them 
into the common data model used by SB-KOM, i.e. 
XML (Figure 4). 

In addition to the wrapper's query service, the web 
services encapsulate an Application Programming 
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Figure 2 The ontology-based search interface of YeastMed. It shows how is captured the example sketched in this section. 



Interface (API). It is the access point for SB-KOM to the 
functionality of the web service. This API publishes 
three methods: getQuery(Q) that passes to the wrapper 
the XQuery Q and returns its answer in an XML for- 
mat. The XML structure of this answer must satisfy the 
constraints of the source schema. The other two meth- 
ods, getSchemaQ and getProvenanceQ, provide access to 
the metadata that the web service stores. The former 
returns the XML data schema and the latter provides 
information on the underlying data source. In order to 
use these methods correctly, SB-KOM finds all the 



necessary information about them in a WSDL (Web 
Service Description Language) document. 

The Data Services have been implemented in Java. 
They receive XQueries from SB-KOM via the getQueryQ 
method of the API which passes it to the wrapper. This 
is materialised by a set of java classes that define several 
methods. The incoming XQuery is analysed to identify 
precisely what information is solicited from the underly- 
ing data source. The wrapper then generates a source- 
adapted query following the query capabilities of the 
source already specified for each Data Source. Then it 
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Figure 3 A fragment of the yeastract schema. It is used in YeastMed as a model describing data and their organisation in data source and 
defines a structure under which results will be returned by the Yeastract web service. 
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Figure 4 Architecture of the web services in YeastMed system. Services receive Xqueries through the different methods of the API and 

transmit them to the wrapper. The output is an XML document. 



establishes a connection to the data source via HTML 
or FTP protocols. In the case of HTML protocol, the 
data source is interrogated through its web interface 
using its query engine. The answer is one or several web 
pages which are parsed on the fly to extract the solicited 
data. In the case of FTP protocol, the data source is 
interrogated through its available flat files which are also 
parsed on the fly. A set of methods are defined to 
extract data from source answers and organize them as 
an instance of the XML source schema before to send it 
to the SB-KOM in the form of an XML document. 

YeastMed is able to reflect data provenance by calling 
the method getProvenanceQ which returns information 
about a source or through the XML document returned 
by Data Services: it contains by default a description of 
the interrogated data source. Thus, instances with the 
integrated data can be annotated with the data prove- 
nance of each piece of information. In this way, the user 



interface could show users the provenance of each part 
of the results. 
YeastMed ontology 

As mentioned before, the goal of the YeastMed System 
is to help scientists to get information from multiple 
Yeast data sources by providing a single access point. 
To that end, we have equipped YeastMed with a domain 
ontology. The primary purpose of this Ontology is to 
support the user queries. Queries are phrased in terms 
of the ontology and YeastMed converts these to XQuery 
requests to the appropriate sources via Data Services. 
The YeastMed ontology has been constructed from 
scratch by reconciliating the different data source sche- 
mas into a single, coherent ontology. 

The YeastMed ontology [34] ensures semantic encap- 
sulation of data sources by defining a concepts hierar- 
chy. This is a classification of all the biological entities 
manipulated by the system. It represents a knowledge 



Briache et al. BMC Bioinformotics 2012, 13(Suppl 1):S7 
http://www.biomedcentral.com/1471-2105/13/S1/S7 



Page 9 of 19 




OeneSequenc 



5> 



GenomicSequenc 



ice 



GenomicSequenc e 1000 
GenomicSequenceExcludinglntrons 1 



IntronSequenc 



CodingSequence 



BindingSiteSequence 



ice 
ence 



PromoterSeq 



uence 



ExonSequence 



Figure 5 A fragment of the YeastMed ontology. It shows the semantic encapsulation of the concepts related to DNA sequences in YeostMed. 
For example the concept GeneSequence which represents the set of gene sequences is a child of the concept DNASequence which represents all 
type of DNA sequences. DNASequence is in turn a child of NASequence which represents the set of nucleic acid sequences. 



model that captures biological and bioinformatics 
knowledge in a simple hierarchical conceptual frame- 
work constrained by parent-child relationships (Figure 
5): A child is a subset of a parent's elements; each child 
inherits all of its parent's properties but has more spe- 
cialised properties of its own. Overall, the ontology con- 
cepts can be classified into two categories: the purely 
biological concepts category and the source-related con- 
cepts category. 

■ The purely biological concepts category, which is a 
union of all the classes modelling biological entities 
found in the integrated data sources. As an example 
of this category, we cite ChromosomalFeature con- 
cept. It is the superclass of 38 classes representing 
different types of chromosomal features (genes, CDS, 
intron, repeat regions, etc). 

■ The source related concepts category is repre- 
sented by concepts referring to sources. For example 
the concept Source represents the five integrated 
data sources and the concept Entry refers to entries 
in data sources. Adding this category to the ontology 
has as the objective to permit scientists, when using 
YeastMed, to express their preferences on data 
sources. So, giving the possibility to determine which 



source entry they want YeastMed to return if a 
result is found rather than the system making its 
own choice. 

To convey additional semantic information about the 
concepts, the ontology defines two types of properties. 
The first one is defined by a set of object properties that 
model the relationships that can hold between two indivi- 
duals belonging to one or two different classes of the 
ontology. The second type concerns data properties: these 
are relationships linking an individual to a literal data. 

To further illustrate the role of properties in convey- 
ing semantics to the YeastMed ontology, we detail a 
real-world example (Figure 6). SWI4 [35], having the 
systematic name YERlllc, is a gene coding for a DNA 
binding component of the SBF complex (Swi4-Swi6), a 
transcriptional activator that in concert with MBF 
(Mbpl-Swi6) regulates late Gl-specific transcription of 
targets including cyclins and genes required for DNA 
synthesis and repair, an example is Topoisomease I [36] 
(which have the standard name TOPI), From this we 
can make the following assertions: 

■ SWI4 and TOPI are two Genes having the sys- 
tematic names YERlllc and YOL006c; 
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Figure 6 A schematic representation of the example sketched in this section. It shows four biological concepts (ellipses) linked by four 
object properties (red arrows) and two parent-child relationships (blue arrows), and two datatype properties (green arrows) linking two concepts 
to values of type String (rectangles). 



■ SWI4 and TOPI code respectively for a Transcrip- 
tion Factor and an Enzyme; 

■ SWI4 regulates the transcription activity of TOPI; 

■ Both SWI4 and TOPI code for proteins (having 
respectively the same standard names as their 
Genes). 

These assertions let one define: 

■ Four concepts: Gene, Protein, TranscriptionF actor, 
and Enzyme; 

■ Four object properties: codesFor and its inverse 
property codedBy linking Gene to Protein, in addi- 
tion to the property regulates and its inverse regula- 
tedBy linking TranscriptionF actor to Gene; 

■ Two datatype properties: hasSystematicName and 
hasStandardName linking TranscriptionF actor and 
Enzyme to literal values of type String (SWI4 and 
TOPI for the former and YERlllc and YOL006c 
for the latter); 

■ Enzyme and TranscriptionF actor as child concepts 
of Protein. 

In YeastMed, we have chosen OWL [37] as a standard 
Ontology language to represent the ontology. OWL is, 
like RDF [38], taking advantage of the syntactic univers- 
ality of XML. Based on the RDF/XML syntax, OWL 
provides a way to write web ontologies. It is different 
from the couple RDF/RDFS in the sense that is just a 
language of ontologies: If RDF and RDFS bring the user 
the ability to describe classes (with constructors) and 



properties, OWL incorporates, in addition, comparison 
tools for properties and classes: identity, equivalence, 
contrary, cardinality, symmetry, transitivity, disjunction, 
etc. Thus, OWL offers for machines a greater capacity 
of interpretation of the web content than RDF and 
RDFS [39], with a wider vocabulary and a real formal 
semantics. To be more precise, we have contented our- 
selves with using OWL-Lite (which is a sub-language of 
OWL) because we have envisaged from the beginning to 
equip YeastMed with a simple domain ontology showing 
a simple concepts hierarchy and simple constraints. 
Mappings 

Having a domain ontology facilitates the formulation of 
queries to the system. The users simply pose queries in 
terms of the ontology rather than directly in terms of 
the Source Schemas. Although this is very practical and 
effective in terms of the system transparency to the 
user, it brings the problem of mapping the query in the 
mediated schema to one or more queries in the schemas 
of the data sources. In YeastMed, this problem is solved 
using the functionality of SB-KOM. So in addition to 
modelling the ontology and the sources, we needed to 
establish associations between the concepts in the ontol- 
ogy and the appropriate elements representing the infor- 
mation in the sources. These associations are 
materialised in YeastMed by the mapping rules. 

SB-KOM is designed to decompose queries based on 
GAV approach-based mappings. That means each con- 
cept (also property in our case) in the ontology is a view 
defined in terms of the source schemas' elements. This 
view specifies how to obtain instances of the mediated 
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schema elements from sources. In this context, the map- 
ping rules we have used are defined as pairs (P,Q). P is 
one or a couple of path expressions on a source schema 
expressed in XPath, and Q a conjunctive query 
expressed in terms of the Ontology terms. Three kinds 
of mappings have been defined: 

■ Class Mapping: it maps ontology classes to source 
schemas. It has the following form: 

XPath- Element -Location , Ontology- 
Class-Name, correspondence- index 
Where XPath-Element-Lo cation is the location of an 
element in the source schema, expressed in XPath; 
Ontology -Class-Name is the name of the correspond- 
ing class in the Ontology and correspondence-index 
is an integer value that informs on the correctness 
of the mapping instance. In YeastMed, this index is 
always 100 since all the mappings are done manually 
and not automatically. An example which maps the 
Protein class to the SGD schema is as follows: 
Re suit/ En tries/ En try /Protein, 
Protein, 100 

■ Datatype Property Mapping: it maps ontology 
datatype properties to source schemas. It has the fol- 
lowing form: 

XPath- Domain- Location ; XPath- value - 
Location, Ontology-Domain-Name; Prop- 
erty-Name, correspondence- index 
XPath-Domain-Location is the Path to the element 
in the source schema which is mapped to the 
domain of the datatype property; XPath-value-Loca- 
tion is the Path to the element where the property 
takes the value of its range and Ontology -Domain- 
Name and Property-Name are respectively the 
domain and the name of the property. The following 
example concerns the datatype property hasName: 
Re suit /Entries /Entry /Protein; Result/ 
Entries /Entry /Protein/ SysName , Tran- 
scriptionFactor ; has Name, 100 
■ Object Property Mapping: it maps ontology object 
properties to source schemas. It has the following 
form: 

XPath- Domain -Location ; XPath- Range - 
Location, Ontology-Domain-Name; Ontol- 
ogy-Range-Name; Property-Name, corre- 
spondence - index 

XPath-Range-Location is the Path to the element in 
the source schema which is mapped to the range of 
the object property. Ontology-Range-Name is the 
range name of the property. The following example 
shows how the object property hasBibRef is mapped 
to the source schema: 



Result /Entries /Entry /Protein ; Result / 
En tries/ En try /Literature, Protein ; 
BibRef; hasBibRef , 100 
SB-KOM 

YeastMed relies on SB-KOM [15], to perform query 
transformations at execution time. KOMF is a generic 
infrastructure to register and manage ontologies, their 
relationships and also information relating to the 
resources. This infrastructure is based on a resource 
directory, called Semantic Directory [40], with informa- 
tion about web resource semantics. KOMF has been 
successfully instantiated in the context of molecular 
biology for integrating biological data sources [41-43]. 
SB-KOM mediator is composed of three main compo- 
nents: the Controller, the Query planner and the Evalua- 
tor/Integrator. 

The Controller component receives requests coming 
from the YeastMed web interface and evaluates them to 
obtain a result for the requests. The controller creates 
different threads for different user requests, and assumes 
the role of the middleware between the mediator com- 
ponents. Queries are expressed as conjunctive predicates 
[44], with three main types of predicate: classes in terms 
of YeastMed ontology which is registered in the Seman- 
tic Directory, datatype properties that link individuals to 
data values, and object properties that link individuals 
to individuals. The results of these queries are instances 
of the YeastMed ontology which the query was 
expressed in. 

The Query planner component is by far one of the 
most fundamental pillars in elaborating one or several 
query plans to solve the query from different data 
sources. Plans generated by this component specify the 
data sources from which the information can be 
retrieved and in which order they must be accessed. 
The evaluation of these queries depends on the query 
plans themselves. 

According to the query (a conjunctive query), there 
will be different types of mapping in the Semantic 
Directory. Classes will be connected to the XPath of one 
or several XML Schema resource elements. On the 
other hand, datatype properties will be connected to 
those two expressions: the first one corresponds to the 
class and the second to the property. The object proper- 
ties will be related to the active XPath classes in the 
property. 

The Query Planner runs following a simple algorithm 
that receives as entry a conjunctive query expressed in 
terms of the YeastMed ontology (conjunction of con- 
cepts and properties) and returns a set of possible plan 
trees. The algorithm steps are enumerated below (for a 
use case see the following section): 
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1. Get all the query predicates (concepts and proper- 
ties) and distribute them in two groups based on the 
number of the arguments: G7 will contain predicates 
having one argument (concepts) and (f 2 will contain 
predicates having two arguments (properties). 

2. Construct a set (f s of combinations between the 
two groups based on common arguments, add all 
the elements of G7 and (f 2 to it and eliminate the 
repeated ones. 

3. Eliminate from (f the elements that do not have a 
representation in the mapping rules registered in the 
Semantic Directory. 

4. For each instantiated variable in the predicate 
arguments, elaborate a plan tree: 

a. The instantiated variable will construct a root 
node. 

b. The elements that contain a predicate specify- 
ing a value for the instantiated variable and the 
elements that contain only the instantiated vari- 
able (without other variables) will be passed to 
the current node and eliminated from (f . 

c. The elements that contain, in addition to the 
instantiated variable, another variable will consti- 
tute the edges leaving the current node to new 
nodes and eliminated from (f . The newly cre- 
ated nodes will be represented by the other vari- 
ables which will be the instantiated variables. 

d. if there are still more elements in (f s and for 
each new instantiated variable we continue from 
the step 4.b. 

The Elvaluator/Integrator is the third component of 
SB-KOM mediator. It analyses the query plan (QP), and 
performs the corresponding calls to the Data Services 
involved in the sub-queries (SQ1, SQn) of the query 
plan. To answer YeastMed query, this component first 
executes the Data Services in the order specified by the 
query plan. Then, it obtains the instances from the Data 
Service results. These instances are not interconnected 
because they have been produced by different Data Ser- 
vices. In order to retrieve a set of interrelated instances 
we need to establish relationships between them. This 
can be achieved by the object properties defined in the 
ontology that are used as relationships between services 
in the query plan. Finally, these interrelated instances 
are filtered in order to eliminate the information not 
required. 

Use case 

In this section, we show how a user query is solved by 
YeastMed, and how its different components take part 
in this process. Let us take the case of a biologist who is 
using YeastMed to find information about two kinds of 
proteins. The first one is represented by DNA 



Topoisomerase III, and the second one is indicated by 
some transcription factors regulating the expression of 
the first kind. The biologist is interested in the phos- 
phorylation sites that are found in the sequences of the 
transcription factors of DNA Topoisomerase III, espe- 
cially the one (or ones if they exist) whose gene is 
located on the Chromosome 16. In addition, the biolo- 
gist also aims to get all the literature on DNA Topoi- 
somerase III. As stated previously, YeastMed provides a 
web interface that allows biologists to express this kind 
of requests in terms of the ontology. The user can for- 
mulate its request in the YeastMed interface by selecting 
fields' items as follows: 

"Protein", "having Description" , "DNA Topoisomerase 
III"; 

"Protein", "having Bibliographic Reference", "Biblio- 
graphic Reference"; 

"Protein", "Regulated By", "Transcription Factor"; 

"Transcription Factor", "Belongs To", "Chromosome"; 

"Chromosome", "having Name", "16"; 

"Transcription Factor", "having Phosphorylation Site", 
"Phosphorylation Site". 

To specify to the system what to return, the user 
should add checkmarks by clicking on the boxes above 
the fields where "Bibliographic Reference" and "Phos- 
phorylation Site" were chosen before to submit its query. 

The fragment of semantics that is implied directly in 
the formulating process of that query is shown in (Fig- 
ure 7). From this fragment, a conjunctive query is gener- 
ated automatically: 

Ans (BR, Ph) : = Protein (P) , hasDe script ion 
(P, " DNA Topoisomerase III" ) , BibRef (BR) , 
hasBibRef (P, BR) , hasSys t emat i cName (P, 
SN) , regulat edBy ( P , TF ) , hasName ( TF , Nt ) , 
Transcript ionFactor (TF) , Chromosome ( C ) , 
hasName (C, "16" ) , BelongsTo (TF, C) , Phos- 
phoSite(Ph) , hasPhosphoSite (TF , Ph) 

This conjunctive query includes as predicates five 
ontology classes {Protein, BibRef, Transcription! ^actor, 
Chromosome and PhosphoSite), three datatype properties 
(hasDescription, hasSystematicName and hasName) and 
four object properties (hasBibRef, regulatedBy, belongsTo 
and hasPhosphoSite). This query will return instances of 
PhosphoSite and BibRef that satisfy its constraints. 

As a subsequent step, the conjunctive query will be 
sent to SB-KOM, received by the controller which will 
pass it to the Query Planner. This component has an 
algorithm that, based on the query predicates and the 
mappings of the semantic directories, will generate a set 
of sub-queries and also a plan to execute them. The 
predicates of the conjunctive query are divided into two 
sets: a set that contains predicates with a single argu- 
ment and another that contains predicates with more 
than one argument. The predicates from the two sets 
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Figure 7 The fragment of the ontology invoked to formulate the query example. Classes are shown in green and Properties in blue. The 
mappings between the ontology and the source schemas are present above the ontology element (in red). 



which have common arguments are then grouped 
together into groups represented by the combination of 
two or more predicates. The groups that are not repre- 
sented in the Semantic Directory mappings are dis- 
carded. The remainder is added to the first set allowing 
a group to be present only once. (Table 1) lists all 
resulting groups. 

From this set, the planner will try to construct poten- 
tial trees of the execution order. It selects groups with 
variables instantiated in order to set a root for a tree. 
The order of the plan execution depends on the instan- 
tiated variables: the group containing an instantiated 



variable is executed first, then the groups that are 
related to those variables, and so on until all the groups 
are executed. In our case, G2 and G8 are selected. G8 
cannot serve as a root, because there is no other group 
that depends on its instantiated variable which keeps 
the other groups without execution. This is not the case 
for G2 which serves as a root for the tree shown in (Fig- 
ure 8). It is the first to be executed. This returns the 
protein that has as description "DNA Topoisomerase 
III". Then G9 and G10 are executed in parallel because 
they depend on the instantiated variable of G2. From 
these simultaneous executions, the algorithm will 



Table 1 The groups used to form the plan tree. 


Group 


Query 


Mapping source 


G1 


Protein(P), hasBibRef(P,BR) 


SGD 


G2 


Protein(P),hasDescription(P,"DNA Topoisomerase III") 


SGD 


G3 


Protein(P), hasSystematicName(P, SN) 


Yeastract 


G4 


Protein(P), RegulatedBy(P, TF) 


Yeastract 


G5 


TranscriptionFactor(TF), hasName(TF, Nt) 


Yeastract 


G6 


TranscriptionFactor(TF), belongsTo(TF,C) 


Yeastract 


G7 


TranscriptionFactor(TF), hasPhosphorylationSite(TF ; Ph) 


PhosphoGrid 


G8 


Chromosom(C), hasName(C,"16") 


Yeastract 


G9 


regulatedBy(P,TF) 


Yeastract 


G10 


hasBibRef(P,BR) 


SGD 


G11 


belongsTo(TF,C) 


Yeastract 


G12 


hasPhosphoSiteaF,Ph) 


PhosphoGrid 


G13 


Protein(P) 


SGD; Yeastract; PhosphoGrid 


G14 


TranscriptionFactor(TF) 


Yeastract; PhosphoGrid 


G15 


BibRef(BR) 


SGD 


G16 


Chromosome(C) 


Yeastract 


G17 


PhosphoSite(Ph) 


PhosphoGrid 



For each group the mapping source is given. 
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Figure 8 The plan tree generated from the conjunctive query. The plan tree is a binary tree where nodes are represented by variables of 
the predicate arguments in the conjunctive query and the edges are predicates containing the two variables of the nodes they are linking. 



determine all the objects that are related to Protein by 
means of the relationships regulatedBy and hasBibRef. 
Once those objects are obtained, it will check whether 
they satisfy G14 and G15: that means checking if the 
objects obtained from G9 and G10 are respectively of 
the type TranscriptionF actor and BibRef. Based on the 
result of G9, groups Gil and G12 are executed but not 
simultaneously. SB-KOM has a plan optimisation mod- 
ule that might change the order of the initial plan 
execution as is the case here: Since G8 has a variable 
instantiated (value "16") and is related to G14 via Gil, 
this one is executed before G12, and the result is used 
by the group to be executed. The arcs of the planning 
trees generated by the planer represent object proper- 
ties, while the nodes are ontology concepts or instances 
of these. Each node and arc contains all the necessary 
information for the E valuator/ Integrator to execute sub- 
queries. That is: the XQuery (elaborated from the map- 
ping) corresponding to the sub-query of the node or the 
arc, the names and the URLs of the Data Service of 
interest. An example is shown in (Figure 9). 

The YeastMed Data Services are executed by the Eva- 
luator/Integrator following the plan, after optimisation, 
generated by the Planner. In our case, SGD Data Service 
receives the first sub-query, because the object property 



hasDescription is mapped to the SGD Schema. TOP3 is 
returned as an answer of this sub-query and then is 
used by the sub-query RegulatedBy to find instances of 
TranscriptionF actor . The Yeastract Data Service is 
invoked this time because the property is mapped to the 
Yeastract Schema. Three instances of the type Tran- 
scriptionF actor are returned: Fhllp, Hsflp and Swi4p. 
For each of these instances, the Yeastract Data Service 
is called again. It receives this time the sub-query repre- 
sented by the property belongsTo that contains the two 
arguments instantiated: the first one is one of the three 
instances returned by the previous query, and the sec- 
ond argument is instantiated by the name of the chro- 
mosome 16. This sub-query checks whether the 
Transcription factor has its coding gene on the chromo- 
some 16. Only the instance Fhllp is maintained. Finally 
the sub-query hasPhosphoSite is executed on the Phos- 
phoGrid Data Service that returns all the Phosphorite 
instances of the Transcription Factor Fhllp. At each 
execution, the Evaluator/Integrator receives results in 
XML format from the target Data Services. 

These results are instances of the XML schemas of the 
underlying sources. Based on the mapping between the 
elements of the source schemas and the elements of the 
ontology, these XML schema instances are translated 
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^Ontology: YeastMedOntology, http://khaos.uma. es:8083/ontologies/YeastMedOntology 
Resource: SGD, http://khaos.uma. es:8088/services/SGDService 
Ontology terms: Protein; hasDescription, 

Resource elements: /Result/Entries/Entry/Protein; /Result/Entries/Entry/Protein/Description, 

Xquery: 

for $d in /Result/Entries/Entry/Protein 
where $d/Description eq "DNATopoisomerase III* 

return $d J 

Figure 9 The information presented by the node P in the plan tree. The node P contains the location of the YeastMed ontology and the 
web service to call, in addition to the mapping resources and the xquery to send to the web service. 



into ontology instances which are not interconnected 
because they have been produced by different Data Ser- 
vices. To associate them, the Evaluator/Integrator uses 
just the instances of the domain and range classes of the 
object properties. The final result is an ontology 
instance that includes all the data extracted from the 
interrogated data sources. That is all the instances of 
the concepts BibRef of the protein TOP3 and all the 
PhosphoSite objects of the Transcription Factor Fhllp. 

Results 

We have conducted a usability assessment in order to 
grade how well biologists can learn and use YeastMed 
to achieve their goals and how satisfied they are with 
the system. We have also conducted a performance 
study of the system to reveal how run times behave 
towards the increase of the number of implied data 
sources in queries. In this section we present results 
obtained from these two studies. 

System usability 

A variety of methods have been reported in the litera- 
ture for assessing the perceived usability of interactive 
systems. We can particularly cite QUIS [45], SUS [46], 
CSUQ [47] and Microsoft's Product Reaction Cards 
[48]. Tullis and Stetson [49] reported a study that com- 
pared these methods and showed that the accuracy of 
the analysis increases as the number of participants gets 
larger (for a sample of 6 to 14) and that the accuracy of 
SUS increases quicker than the others. For that, we have 
used SUS method in our study. The SUS questionnaire 
consists of 10 items to which participants rate their 
level of agreement. Odd-numbered items are positively 
worded and even-numbered items are negatively 
worded. A 5-point scale of agreements numbered from 
1 (anchored with "Strongly disagree") to 5 (anchored 
with "Strongly agree") is used for each. Each item's 
score contribution will range from 0 to 4. For odd-num- 
bered items, the score contribution is the scale position 
minus 1. For even-numbered items, the score contribu- 
tion is 5 minus the scale position. To get the overall 



SUS score, which is the indicator of usability, the sum 
of the item score contributions is multiplied by 2.5. SUS 
scores ranges from 0 to 100, with 100 representing a 
perfect score. 

The usability study we conducted had two objectives: 
(1) having a general indicator on the usability of 
YeastMed, and (2) assessing the evolution of the system 
usability with the level of familiarisation to biological 
databases. This is represented, in our study, by the fre- 
quency of using biological databases of the participants. 
These objectives will let us (1) to grade how well biolo- 
gists, in general, can learn and use the system and (2) 
how well we have succeeded to furnish an easy-to-use 
system for biologists who are familiar with HTML 
forms of biological databases. 

There were a total of 39 participants. Each one tested 
YeastMed before completing the SUS questionnaire. All 
the participants are biologists spread over 5 groups with 
different levels of familiarisation to biological databases. 
These groups contained between 7 and 9 participants 
and are named following the participants frequency of 
using biological databases, i.e. Never, Rarely, Sometimes, 
Usually and Always. For each participant we have calcu- 
lated the individual SUS score and then the mean score 
for each group was determined. As shown in (Figure 
10), the usability of YeastMed increases with the famil- 
iarisation to biological databases: The mean SUS score 
passes from 60.71 for biologists who never used biologi- 
cal databases to 78.75 for biologists who are always 
using biological databases with an overall SUS score of 
71.54. With these scores, we can say that YeastMed, 
with its simple HTML form-based interface, is a system 
easy-to-use for biologists who are familiar to biological 
databases interfaces with a relatively lower usability for 
biologists with lower familiarisation. 

System performance 

To illustrate the performance of YeastMed, we present, 
in this section, the result of a study conducted on the 
run times of the three main stages of the YeastMed 
query processing: Planification, Execution and 
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Figure 10 SUS scores of YeastMed in function of the frequencies of using biological database. The graph shows that the evolution of the 
mean SUS scores increases progressively when the frequency of using biological databases of biologists increases. 



Integration. The study concerned 50 queries distributed 
on 5 groups following the number of data sources parti- 
cipating in the query answer (from 1 to 5). All queries 
were run on a dual-processor 2.33 GHz Pentium 4 pro- 
cessor machine with 4 GB of memory. The objective of 
this study is not to provide a thorough performance 
analysis, but simply to show how run times behave 
towards the increase of the number of implied data 
sources. Each query was executed in three instances 
before to calculate its mean run times for the three 
stages. The data sources implied in the multi-sources 
queries are called exactly one time for each query. This 
had the objective to give certain uniformities to the 
study. 

(Figure 11) illustrates the obtained results. It shows 
that there are no big changes in the Planification times 
when the number of the implied sources increases. The 
Planification time passes from 1.149 seconds for queries 
implying one source to 1.252 seconds for others that 
call 5 sources. In contrast, the execution time behaves 
differently. It increases with the number of the implied 
sources. This was expected because the execution of 
sub-queries in YeastMed makes use of a set of web ser- 
vices which are not called simultaneously but serially 
due to the fact that the call of a web service might 
depend on the result of another. As to the Integration 
run time, it shows also some increases but small com- 
pared to the Execution run time. It passes from 1.149 
seconds for one-source-based queries to 5.589 seconds 
for queries implying 5 data sources. In YeastMed, the 
Integration stage is solicited even if just one source is 
implied. This is because, in addition to the integration 
process, it performs the transformation of the XML 
result returned by the web services to an RDF instance 
of the YeastMed ontology. 



Discussion 

Dynamic integration is a very important issue for tradi- 
tional mediator-based systems. They are usually devel- 
oped as monolithic systems and their architecture based 
on wrappers involves a high degree of coupling among 
the system components. They usually do not provide 
scalable and reusable solutions. By the modular design 
and the uncoupling of all the components of YeastMed, 
we have sought to break out of traditional mediation 
architecture and provide a flexible platform for integrat- 
ing Yeast data sources. The modular structure of the 
system reduces the costs of the system maintenance. 
The system can be easily extended to cover other 
sources. It is not required to rebuild the system from 
scratch. The new source components are built indepen- 
dently and then integrated in the system, i.e. publishing 
a web service underlying that source, adding semantic 
views on the source to the ontology, and publishing its 
mapping rules in the Semantic Directory. The rest of 
the system components are not touched. On the other 
hand, the fact that the system adopts a mediation 
approach avoids the updating problem when a change is 
made in a source at the level of data, because the system 
does not have a local copy of data. But when the change 
touches the structure of the flat files or the HTML 
pages from which YeastMed extracts data, the system 
will need to reflect this on its components, but only on 
the modified-source components: the modified-source 
schema, the mapping rules implying that source, the 
source-related entities in the ontology and also the web 
service of the source. The other source components are 
not modified and the system is not rebuilt from scratch. 

Relying on Data Services rather than classical wrap- 
pers offers the possibility of reusing them by other med- 
iators or any other data accessing application. This is 
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the case of the quick search service we are proposing in 
the web site of YeastMed. This service is an added value 
of the system architecture. It makes use independently 
of the Data Services to look for entries and direct access 
to the integrated data sources without passing through 
the system mediator. 

While some mediator-based systems require a specific 
query language or propose a complicated graphical user 
interface [25,27,29], the YeastMed mediator receives 
conjunctive queries expressed in terms of the ontology. 
Even though we estimate that these are not very compli- 
cated for biologists to express their requests, we have 
proposed a simple interface where requests are 
expressed using natural language through simple forms. 
All the required translations to conjunctive queries are 
hidden from users. 

It is known that biologists have their own preferences 
toward databases [50]. In YeastMed we take this into 
account by giving users the possibility to specify from 
which database they prefer to get answers. The 
YeastMed ontology includes some source-related con- 
cepts which permit the user to express preferences on 
data sources. For example, a user can specify SGD as a 
source from which to get entries by selecting SGDEntry 
in the query form. Specifying a data source does not 
mean reducing the constraints to be only applied on the 
data of that source. Users can specify a source from 
which to get entries and apply constraints on related 
data from other sources. For example a user can ask for 
entries from SGD describing a chromosomal feature 
regulated by a transcription factor having the standard 



name Rtgl. This is translated into the following con- 
junctive query: 

Ans (E) : = SGDEntry (E) , describes (E , F) , 
ChromosomalFeature (F) , RegulatedBy (F , R) , 
Transcript ionFactor (R) , hasName (R, 
"Rtgl") ; 

In this conjunctive query, SGD entries are solicited, 
but all the constraints are made on data residing in 
Yeastract (data related to Transcription Factor). If the 
SGD entry has not been specified, the result entries will 
be returned by default from Yeastract. YeastMed is able 
to find the equivalent of such entries (if any) in SGD. 

Conclusions 

We have described YeastMed: an XML and mediator- 
based system that Integrates five Yeast databases which 
have the most appropriate properties for studying Sac- 
charomyces cerevisiae. 

Data Services play an important role in the integration 
process of this system, where they are considered as an 
interface which receives queries, accesses to a data 
source, extracts data and translates them into a common 
data model used by SB-KOM. In YeastMed, Data Ser- 
vices extract data mainly from flat files because most of 
the integrated data sources are accessible via ftp 
mechanisms and provide data in tabular or XML format. 
This reduces the costs of the maintainability of the sys- 
tem because flat files structures are not frequently target 
to changes. 

In our system, the schema integrator is an ontology 
and the results are ontology instances. The use of the 



Briache et al. BMC Bioinformotics 2012, 13(Suppl 1):S7 
http://www.biomedcentral.com/1471-2105/13/S1/S7 



Page 1 8 of 1 9 



ontology and instances enables basic reasoning pro- 
cesses (class-subclass inference) to be later included. 
This will permit YeastMed to infer new relationships 
between the instances of the ontology when solving a 
user query and thus, discover new knowledge for the 
query answers. The final result is an ontology instance 
that includes all the data extracted from the integrated 
data sources. It is converted to an HTML Format before 
to be presented to users. 

The objectives expected from the YeastMed system 
are not yet all met. The system is still in its natal phase 
and additional work is undertaken to improve it. The 
system does not yet make all its ontology available to 
users when formulating queries. This is because it is not 
yet able to answer queries expressed in terms of some 
part of the ontology. In addition, the fact that the sys- 
tem answers only conjunctive queries limits the user 
requests expression; i.e. it is not able to answer queries 
using disjunction quantification (or union operator 
which is denoted as U). 
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